refactor: single source of truth for agents, scope secrets per-org, webhook-only platforms#505
refactor: single source of truth for agents, scope secrets per-org, webhook-only platforms#505buremba wants to merge 9 commits into
Conversation
…LOCAL_BOOTSTRAP `ensureBootstrapPat` now self-skips when the user table is non-empty, making the previous LOBU_LOCAL_BOOTSTRAP=true env-flag gate redundant. Operators no longer need to opt in for first-run local dev: an empty PGlite data dir mints a single admin PAT (printed once, scope unchanged), while production deployments with provisioned users are protected by the new user-count guard. scripts/e2e-lobu-apply.sh stops setting the flag — the bootstrap PAT is now produced automatically.
…R BY The Socket Mode loop has always returned after the first connection with an appToken (`break` at line 130), making the previous `LIMIT 10` redundant without explaining why. Replace with `ORDER BY id` so which connection wins is deterministic, and add a doc comment naming the single-client limitation so multi-org operators reach for the Events API webhook transport. Behavior unchanged.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Triage decision: Reasons:
Next: Human review required to resolve the blocking status checks. Check the failing check-drift workflow and wait for integration tests to complete. |
The gateway no longer reads lobu.toml at boot. File-declared agents now enter Postgres exclusively via \`lobu apply\` (CLI) — there is one source of truth for agent state, addressed by \`organization_id\`. Deleted: - gateway/config/file-loader.ts - 5 file-loader test files - entryFromFileLoadedAgent helper - CoreServices: loadAgentConfigFromFiles boot path, populateStoreFromFiles, reloadFromFiles, onReloadFromFiles, getFileLoadedAgents, isFileFirstMode, getProjectPath - /api/v1/reload dev route - gateway/cli/gateway.ts file-loader-driven platform connection seeding Kept (still used by SDK-embedded mode via GatewayConfig.agents): - DeclaredAgentRegistry - InMemoryAgentStore (now only the SDK-embedded backing store) - entryFromAgentConfig + buildRegistryMap (signature simplified to take only configAgents) The system/manifest owner sentinel is gone — every agent in the gateway now has a real owner and a real organization_id. Bun typechecks clean repo-wide. Behavior change: gateway boot will now throw a clear error if neither host stores nor SDK config.agents are provided (previously it would have silently picked up lobu.toml).
Previously the agent_secrets table had a global \`(name)\` primary key, so two organizations storing the same logical secret (e.g. ANTHROPIC_API_KEY) would silently overwrite each other on conflict. - Add \`organization_id\` column with PK changed to \`(organization_id, name)\`. - New ondisk migration plus an embedded-schema patch so existing PGlite installs pick it up on next boot. - Empty-string organization_id is the legacy/global bucket — used by system env-store and any pre-existing rows. - PostgresSecretStore reads \`tryGetOrgId()\` from AsyncLocalStorage: per-org rows take precedence over the global bucket on read; writes go into the active org if context exists, else the global bucket; deletes and lists scope to the active org. - The \`secret://<name>\` URI scheme is unchanged — same URI, different value per org context. Stored refs across the codebase don't need migration.
…irst Previously a module-level \`socketModeClient\` singleton meant only the first agent_connections row with an appToken got a WebSocket — every other Slack workspace (especially across orgs) was silently dropped. Replace with a Map<connectionId, SocketModeClient> populated at gateway boot, and disconnect each client on graceful shutdown. Errors starting an individual client no longer take down the rest. Lifecycle hooks for connections added/removed after gateway boot are not wired here — restart the gateway to pick up newly added Slack rows. That hook is a follow-up; this PR closes the cross-org silent-drop bug.
…at SDK The Socket Mode bridge in lobu/gateway.ts ran a WebSocket per Slack connection, translating events into a synthetic POST against ChatInstanceManager.handleSlackAppWebhook. It existed only to support local dev without a public URL. WebSocket-based delivery is meaningfully less reliable than webhooks (reconnect drops, no retry semantics) and Slack's own docs steer production toward Events API. Keeping Socket Mode meant a per-platform runtime dependency (@slack/socket-mode) and a parallel transport story that diverged from every other platform. - Delete startSlackSocketMode + the socketModeClients map - Drop @slack/socket-mode from packages/owletto-backend and packages/cli - Remove the disconnect-all loop from stopLobuGateway - AGENTS.md: codify "one transport per platform — webhooks via the Chat SDK adapter; local dev for webhook-only platforms uses a tunnel" Reverts the work in 7a751bd (ORDER BY) and 8646802 (per-connection map) that only existed to make Socket Mode multi-tenant.
PostgresSecretStore now resolves the same secret:// ref to different
values per org context (via tryGetOrgId AsyncLocalStorage), but
SecretStoreRegistry was caching by ref alone. Org A populating the
cache would serve its plaintext to Org B for the 60s TTL window.
Cache key is now `${orgId}|${ref}`. put() invalidates the ref across
every org scope; delete() walks the cache and clears any org-prefixed
entry that matches the canonical name.
Adds a regression test using an OrgAwareStore fake that mirrors
PostgresSecretStore's per-org resolution.
TokenRefreshJob ran maybeRefresh() with no AsyncLocalStorage org scope,
so PostgresSecretStore.put() (called via AuthProfilesManager.upsertProfile)
silently routed rotated refresh tokens into the empty-string GLOBAL
bucket. Reads went to GLOBAL too — when a request later tried to use the
profile, it got the wrong (or no) credential.
Both call paths are affected:
- runOnce (scheduled every 30min via TaskScheduler)
- refreshForUserAgent (lazy at-use, spawned via the scheduler queue —
the original request's org scope is lost by the time the task runs)
Fix:
- scanAllOAuth() INNER JOINs agents to yield (userId, agentId, orgId).
- runOnce wraps each maybeRefresh in orgContext.run({organizationId}).
- refreshForUserAgent looks up the agent's org and wraps before
delegating; logs and skips when the agent row is gone (deleted).
Tests updated: scanAllOAuth fixture seeds backing agents rows, and a
new test covers the orphaned-profile skip path.
…igration - PostgresSecretStore: warn loudly when put()/delete() runs without an org context. System env-store reaches here intentionally; tenant code paths reaching here is a bug — the rotated secret would land in the cross-tenant GLOBAL bucket and shadow every org's read of the same name. Loud logs make the regression visible in observability. - start-local.ts: refuse to mint a bootstrap PAT unless dbUrl resolves to a loopback host (127.0.0.1 / localhost / ::1). The user-count guard is the second layer; this catches the case where someone reuses ensureBootstrapPat against a fresh prod DB before the first signup lands. - agent_secrets_org_scope migration: add a ROLLOUT NOTE explaining the non-atomic primary-key swap is safe for Lobu's single-process atomic-swap deployment but unsafe under rolling-deploy supervisors. - lobu/gateway.ts: drop unused getDb import left over from Socket Mode removal.
|
Pushed three fixup commits addressing the pi-cli review (#505 — earlier session findings):
Verification:
Reviewers: the architectural direction of #505 is right (PG single source of truth + org-scoped secrets + webhook-only platforms); the issues fixed here were correctness gaps that would have shown up as cross-tenant data leaks once OAuth refresh started rotating tokens at scale. |
Summary
Six commits cleaning up the multi-tenant setup so agents and secrets are properly scoped per organization, with a single delivery story for chat platforms.
Commits
60a7b21— Auto-bootstrap admin PAT, retireLOBU_LOCAL_BOOTSTRAPensureBootstrapPatself-skips when theusertable is non-empty. The env-flag gate is gone; production deployments are protected by the user-count guard.scripts/e2e-lobu-apply.shupdated.7a751bd+8646802— Slack Socket Mode iterations (superseded byab742bc)Two intermediate commits that first documented and then made the Socket Mode bridge multi-tenant. Both are neutralized by the final commit on the branch — included in history because they're already pushed; net effect is removal.
26a9d04— Remove gateway boot-timelobu.tomlfile-loaderThe big one. Gateway no longer reads
lobu.tomlat boot — file-declared agents enter Postgres exclusively vialobu apply. One source of truth.Deleted:
gateway/config/file-loader.ts, 5 file-loader tests,entryFromFileLoadedAgent, thesystem/manifestowner sentinel,loadAgentConfigFromFiles,populateStoreFromFiles,reloadFromFiles,onReloadFromFiles,getFileLoadedAgents,isFileFirstMode,getProjectPath,/api/v1/reloaddev route, file-loader-driven platform connection seeding.Kept (still used by SDK-embedded mode via
GatewayConfig.agents):DeclaredAgentRegistry,InMemoryAgentStore,entryFromAgentConfig,buildRegistryMap(signature simplified to take onlyconfigAgents).Behavior change: gateway boot throws a clear error if neither host stores nor SDK
config.agentsare provided.c75f748— Scopeagent_secretsbyorganization_idThe
agent_secretstable had a global(name)PK, so two orgs storing the same logical secret (e.g.ANTHROPIC_API_KEY) silently overwrote each other.organization_idcolumn with PK(organization_id, name). Migration + embedded-schema patch for existing PGlite installs.organization_idis the legacy/global bucket — used by system env-store and any pre-existing rows.PostgresSecretStorereadstryGetOrgId()from AsyncLocalStorage: per-org rows take precedence over global on read; writes go into the active org if context exists; deletes and lists scope to the active org.secret://<name>URI scheme is unchanged — same URI, different value per org context.ab742bc— Drop Slack Socket Mode entirely, stick to Chat SDK webhooksWebSocket transports are less reliable than webhooks (reconnect drops, no retry semantics) and Slack's own docs steer production toward Events API. Socket Mode existed only to save local dev users from running a tunnel.
startSlackSocketMode,socketModeClientsmap,@slack/socket-modefrom bothowletto-backendandclipackage.json.stopLobuGateway.Reverts the per-connection Socket Mode work in
7a751bdand8646802.Net effect
Read end-to-end, this PR:
LOBU_LOCAL_BOOTSTRAPenv-flag layer.lobu.tomlreader and the in-memory parallel agent store wiring.agent_secretsper organization.Lines net: ~750 removed, ~130 added.
Test plan
bun run typecheckpasses.declared-agent-registrytests pass (5/5).lobu runon a freshOWLETTO_DATA_DIRmints a bootstrap PAT withoutLOBU_LOCAL_BOOTSTRAP=true.lobu runon a data dir with existing users no longer mints a "Local Developer" user.scripts/e2e-lobu-apply.shstill passes end-to-end.agent_secretsmigration applies cleanly to an existing PGlite install (embedded patch path).ANTHROPIC_API_KEYin org A and a different value in org B no longer overwrites each other.DB-backed integration tests in this branch fail in the sandbox due to no reachable Postgres — they need a real DB to verify.
https://claude.ai/code/session_01GTYRB28aN7dKNiM5vcUaBp